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1 . INTRODUCTION 



In janwiry oi 1965p a program ^as ^organized within the 
Language Analysis and TranalatidniSection of the Research 
Laboratory of The Bunker-R^o Corporation for the investigation 
of sOvezal aspects Of natural lan|uags data processing by formalized 
methOds> As originally conceived^ this program was to embrace 
a number of separate but related projects « each devoted to a 
different aspect of the overall problem. Eventually^ the program 
was to involve sentence analysis and generation by various methods 
based Upon context>free, conte^-- sensitive, free-rewrite, and 
transfornuitlonal linguistic systems. 

The first project under this program was dedicated to the 
development of a context-free recognition grammar for Russian. 
Work on this project, which has already resulted in a grammar 
of impressive scope and power, is still in progress; and since 
our interest in this project has to date precluded the development 
of other projects under this program, the remainder of this report 
will be concerned almost exclusively with the technique that has 
been developed for parsing Russian sentences with a context-free 
grammar, and with the grammar itself . 



1. i REJiEARCH POLICY 



From the very inception of this program, it has been our 
policy to subject ourselves to the rigorous discipline of operational 
demonsi^rability in all phases of the work in progress. That is to 
say, every hypothesis concerning the organization of the 
processing algorithms and the structure and context of the grammar 
Ims been tested as a running computational procedure before 
gaining full acceptance as a working principle. This general 
policy has been observed with respect to ail levels of detail of 
the- wbirk, from the broadest to the most specific— involving at 
the lov'est level the testing of individual grammar rules. 
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The ado^tifeji of this policy required that a very short research 
cycle (desxga, test, evaluate) be maintained, in which the temporal 
distance separating: the linguistic research group from the computer 
center is mimmized. The latter goal has been achieved by carrying 
out the greater part of the project work in the computer center, 
the linguistic research pgrsosael performing all of the required 
programming and equipment operation tasks themselves (generally 
during second and third shift periods) . The uncommon luxury of 
;Such ^^hands^on” working conditions has been made possibly by 
the in-house availability of a corporate -utility r^^search computer 
facility. Many of the features of the linguistic processing system 
developed under this project have been designed specifically to 
exq^loit this advantage to the fullest; in a batch-processing or 
production environment much of the input/output and peripheral 
utility programming would have taken on a markedly different 
form. ' 



1.2 THB^RESEARCH COMPUTER CENTER 



s.. 



1.2.1 Hardware 



Another source of program design constraints has been, 
quite naturally, the hardware available in the computer facility. 
This consists of the following: 



one Bunker-Ramo Model 130 (AN/UYK-1) computer; 

this is an 8K, It^^rbit, parcillel binary, stored 
micrologic machine with a basic read- generate 
cycle time of 6 usee. 



o 8K cells of dif’ectly-ad!dres sable extended memory 



0 one 120-character line printer (BR«>282) 
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o one magnetic tape controUer (BR-192) and 
two transports (BR~170) 

♦ 

o one input/output cohtrolier (BR-143) , v;ith 
associated typewriter^ paper-tape reader 
and punch, and card reader 
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Many of the details of organization of the operating programs — 
and even of the grammar — 4iave been chosen with this hardware 
system^s limited memory and intermediate running speed (as a 
result of the micrologic programming feature) in mind. 



1.2.2 Software 



A third influence on the present form of the algorithm and 
utility programs has been the availability of a version of the 
FORTRAN IV programming language in the computer center's 
software library. The esqjerimental nature of the projects to 
be undertaken made it advisable to sacrifice the efficiency of 
machine -language programming, at the outset, in favor of the 
ease of programming and debugging with a procedure -oriented 
language; while a symbol or string processing language (such 
as COMIT) might have been preferable for this purpose’^, the 
ready availability of FORTRAN, and the familiarity of the 
research group with its use, combined to bring about its adoption 
for the initial development of the language data processing 
algorithms and utility programs. As a consequence, some of 
the characteristics of these algorithms derive ultimately from 
their initial es^ression in an arithmetic -oriented programming 
language. 

The reader wiU find fhany of the details of the exposition 
that follows easier to understand it these three sources of design 
limitations are taken into account. 
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2. THE CONTEXT-FREE PARSING PROJECT 

The progrcgcn for the study of formalized language data 
processing systems began with the project for the parsing of 
Russian sentences with a context-free algorithm and grammar. 
Several motives contributed to the decision to initiate such a 
project. One of the strongest of these was the academic interest, 
on the part of some of the members of this department, in formal 
linguistic systems. Another,, related to this but of a more utilitarian 
torn, was a desire to determine the usefulness of forxhalized 
techniques in practical language data processing applications 
such as machine translation, information retrieval, etc. Moreover, 
at the time of its inception, this project was relevant to departmental 
contractual obligations. 

The second of these motives deserves further elaboration. 

There are several characteristics of formalized linguistic 
procedures which seemed, at least superficicdly, to offer substantial 
advantages in practical language data processing operations. The 
much discussed separation of grammar and algorithm is one such 
characteristic which promised particular benefits for applications 
systems subject to continuing modification through research. 

Another apparent advantage of formalized linguistic procedures 
which was of interest to the research group is the requirement 
of formal "homogeneity" in the grammatical specifications (or 
rules) . A third such advantage is the relative simplicity of 
programs embodying the basic processing algorithms. The 
fundamental justification for our inception and continuation of 
this research program has been the testing of these (and other) 
apparent advantages for practical language data processing 
applications; some conclusions concerning these topics, derived 
from the context-free parsing project, are presented later in this 
report. 
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- 2. 1. THE CONTEXT-FREE PARSING SYSTEM 

From the beginning of this project to the time of this .-writing, 
the overall context-free parsing system has been undergoing a 
continuous evolutionary development. But for a need to save 

. . . ^ . . s 

time,' space, and the' reader's patience- a historical description 
of the system would be the most natural kind of treatment. It 
will be more practical, however, to write only a few words about 
the system's hximble beginnings and then to describe in limited 
detail its present state. 

Originally, it was decided that the relatively simple and 
efficient bottom-to-top direct- substitution procedure, working 
with binary phrase-structure rules, would serve our needs very 
well,. The first task performed under this project was, accordingly, 
the programming, in FORTRAN, of such an algorithm and 
associated input/output Subroutines. This first system was, in 
ail respects, rather primitive, ' It provided for a grammar of 
quite modest proportions (250 rules, each of an ordered triple 
of integers ranging from 000 to 999) on which a crude sequential 
search was performed during look-up operations. A very limited 
kind of. output was provided. Input capacity was restricted to 
short strings with a few grammar codes per item. Program flow 
was relatively inefficient, and it tended to run slowly. These 
deficiencies have since. been remedied, one by one; and an 
efficient and quite powerful sentence analysis, system has evolved. 

In the course of this development, significant changes, have 

been made in the grammar and algorithm of the system. The 

most important of these changes has been the introduction of 

grammatical variables as components of grammatical labels in 

order to enhance the discriminatory power of the system. In his 

1 

latest publication on linguistic theory , Noam Chomsky argues, 

" : — 

See Aspects of the Theory of Syntax, M.T,T. Press, 

Cambridge, l9b^, pp. 67, 89, 90, 2i0,, 211, and 215'.^ For purposes 
of consistency, our use of the term "phrase -structure'-' in this report 
has been limited to accord with Chomsky's interpretation of it. 
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in effect, that this modificktion removes the system from the class 
of phrase -structure systems to the class of transformational 
systems. Chomsky's argiiment on this point is not entirely 
" convincing, however, particularly in view of the radical distinctions 
between the formal properties of this system (and others like it) 
and those of transformational systems as discussed in the rather 
ample literature on the subject. Nevertheless, to avoid the 
possible accusation of abusing the term "phrase -structure", the 
preseht system will be referred to as a vector -symbol phrase 
grammar system; this name is suggested by the modified form 
of. the system's grammatical labels, which consist of ordered 
triples of symbols. 

To simplify the presentation of the system's details, a full 
discussion of the form and function of the grammatical variables 
will be given aifter the operations of the basic (phrase- structure) 
parsing algorithm have been presented. It will then be shown in 
precisely what fashion the grammatical variables restrict these 
operations. This order of presentation is intended not only to 
simplify matters for the reader but also to reflect the evolutionary 
history of the system itself. 

2.1.1 The Algorithm 

2. 1. 1. 1 Description 

• 2 

The structure of the basic parsing algorithm is very simple; 
yet difficult to describe clearly in plain English. For this reason, 
the foHowing brief verbal description is supplemented with a gross 

^abbreviated VSPG. .. 

. Z ' , 

Of the kind known widely as the "Cocke algorithm" after 
John Cocke of IBM. We are indebted tp Martin I^y of the RAND 
Corporation for most helpful discussions and advice at the outset 
of :this project. , . . 



schematic, flowc^rt on pages 10 and 11, 

This algorithm p.erforms elementary list-processing operations 
on a group of four interrelated linear arrays representing the 
developing tree structure(s) assigned to the input string. If these 
arrays are regarded as a single two-dimensional array, then we 
may say that each row of the array represents a single node of 
one or more of the trees; the columns of the array specify, for 
each row, (1), the position in the input string, counting from left, 
of the leftmost terminal dominated by the node corresponding 
to the row, (2) the grammatical label corresponding to the node, 

(3) the nuiriber of the row corresponding to the left-hand daughter 
of the node, (4) the number of the rov/ corresponding to the 
right-hand daughter of the node. A set of counters are maintained 
as pointers during. operations on the. rows, while the columns are 
addressed by name. A compact table is kept of the row numbers 
of the earliest and latest nodes dominating substrings of a given 
length. 

As the algorithm runs, rows corresponding to progressively 
longer well-formed substrings are added to the initial array. 

The procedure begins with attempts to combine, as left-hand 
and right-hand daughters of a single node, rows dominating 
substrings of length 1': (i.^e. > single input iten^s) . Such a 
combination can be made if and only if the substrings dominated 
by the daughter-candi^te rows are adjacent and' the grammar 
contains one or more irules providing for the combining of the 
daughter -candidate rows’ grammatical labels. If both conditions 
are met in a given instance, then a new row is added to the array 
for eacli applicable grammar rule. When all the possibilities for 
forming substrings of length Z (by combining substrings of length 1) 
are exhausted, a new series of attempts are made to form 
substrings of length 3 ^combining, m both orders, substrings of 
ien^h 1 with substrings of length 2) . Next, the formation of 
substrings of length 4 is attempted-— requiring exhaustive testing 



of substrings of length 1 and 3, 2 and 2, and 3 and 1. And so 
on, until no further combinations are ^ssible. When this ‘ 
condition is inet, the latest rOw(s) entered in the basic array 
will represent the longest well-formed substrings in the input 
string; ifrall has gone well, the latest rOw(s) will correspond 
to nodes dominating the entire input string and will carry the 
grammatical label “Sentence". -The flowcharts and esqplanatory _ 
notes, on the next three pages, show the structure of the basic 
algorithm in seme detail. 

This algorithm has recently been reprogrammed in machine 
language for the Bunkasr-Ramo 130 computer, resulting in a very 
substantial gain in Operating speed over that obtained with the 
FORTRAN version. All input/output subroutines are still 
eiq>ressed in FORTRAN, however, simply for the sake of 
convenience of modification if and when the need arises. 

2. 1. 1. 2 Flowcharts of the Basic 
Parsing Algorithm 

2. 1. 1. 2. 1 Eiiplanation of Symbols Used in Flowcharts 
on Following Pages .. In the following basic system flowcharts, 
the four parsing list arrays are referred to as follows: 

INW. position in the input string, counting from 

^ the left, of the leftmost terminal dominated 

by the node corresponding to row i 

GL. grammatical label of the node corresponding 

■ to row 1 

LC. rpw number of the left daughter of the 

^ node corresponding to row i 

RC. row number of ' the right daughter of the 

^ liode corresponding to row i 
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^ number counters and j|ointers (lower-case when 
ajpeat: 

- counter for^tems In the input string 

' -. counter for rb^iy a of- the parsing list arrays 
(or:nodeff) 

coimter for input grammatical labels in 
INP.IJT subprogram 
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pointer for earliest right-hand daughter- 
candidate. 



K,ik - poiht^f for input grammatical, labels 

I*i pointer for left-hand daughter- candidates 

Xj . . pointer for ri^ht-hand. laughter -candidate 

Other symbols are: ^ ./ 



I. 



IGL. 

1 


the i-^th input grammatical label for a 
'given input item 


Pi' 


.phrase length; the length of substrings 
eurrently being' formed 


LSIZE 


the length of the left-hand component of 
the substring under test 


RSI^E 


the length of the; right-hand component 
of the substring /Under test 


LENGTH. 


the length of /the substring dominated by 
the node corresponding to row i , 



; ^Gbmpbn^hls of gram^iar B, C; 

it rid^ of the form A + B —► C. 

Two flow ^agran^^ shows how the input 

d4ta<kre;im^^ parsing array. ;Bbx (3). of the 

9 ^ ae»gle pun!^ed>Oard, 
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PARSING SUBPROGRAM FLOWCHART 








2. 1»1» 3 Mod^iqatioh of the Algorithm 

- Af has j>een mentioned,, substantial^ modifications have been 
inWddtuj^^d: ihtd th^ system, ^he two most important of these are 
de:scTibed in this section. The first, v/hich restricts the strong 
ge^:rative capacity of the system, was motivated by the observation 
; that a context-free phrase-structure system whose grammar 

contains labels tl^t are both left and right recursive will produce 
practically irrelevant ambiguous sl-euctural descriptions. The 
second, and more fundamental, of these modifications ihtrodivced 
veotor-aymbols into the grammar as phrase markers and required 
a corresponding supplementation of the algorithm to manipulate 
them. The first modification affected only the algorithm; the 
second, since it has affected both grammar and algorithm, is 
discuissed below with respect to its effects on. the parsing strategy 
and in a Mter section (2. 1. 2) with respect to its effects on the 
grammar itself. 

2. 1. 1. 3. 1 The Node Suppression Option . The first of these 
special features, the node suppression option , has been adopted 
to de^ with the tendency of context-free parsing systems of the 
present kind to derive trivially different structural descriptions 
for even relatively simple input strings under certain conditions. 
This may occur only when a particular grammatical label is both 
left and right recursive in the grammar. ^ Let us consider four 
ways in which this condition may occur: 

1, A single recursive rule of, the form A H- A A makes 
the grammatical label A both left and right recursive. Strings 

of the fortn ^ •!• . . . will be analyzed ambiguously 

' < 

with this rule. 
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T^s discus sibn is limited in application to systems with 
biryary- gramm space does not permit the development of the 
fully case. Specif fcally excluded from this discussion 

is the case of the ai^ (i«s*« a gra,mma^ containing 

more theh ibne nile for rewriting a given pair of candidate labels) . 
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2. The two recursive rules (1) A B A, and 

G -f A 4>- aaake the label A left recursive and right recursive 
resjgee^lyel^4 Strings of the form C^+ C_+ ^ ® + 

^ ^ 2l be analyzed ambiguously with these two rules. 

■ ‘ f * 

• 3,. The. left and right recursive condition may derive from 
the combination of a recursive rule and a cycle of rules (of indefinite 
length). For example, the three rules (3) + 

(4) C 4* A B, and (5) X + Y C will provide ambiguous 

analyses for strings of the form + 3^+ K + JC •{» JT + + A. 

In tniis case, the label ^ is left recursive by virtue of the cycle 
consisting of rules (3) and (4) . 

4, A set of non- recursive rules may fulfill the left and 
right recursive condition through cycles. For example, the non- 
recursive rules (6) (7) jA+JC-i^ |8) ^ 

provid(5 the label with both left and right recursiveness through 
two cycles consisting of the rules (6) and (7) , and (6) and (8) . 

In this case, strings of the form +X X*** +X X*** • • • + A 
will be ambiguously analyzed by the given rules. 

Other sets of conditions are possible, of course, but these four 
examples are sufficiently varied to facilitate a brief discussion 

which can easily be generalized. In our present operations, 

* .... ** * , < * 

ambiguons structural descriptions resulting from conditions such 
as those in 1. ,, 2. , and 3. , above are considered triviailly different. 
Mo^e specifically, alternative structures resulting from the 
application to a given string of the same set of rules, but in a 
. difforont^orddr# are considered trivially different only When the 
label, ddminatmg. the string is both left and right recursive in the 
1*41© set apd is recursive in a recursive rule in the applied 







In 4» ‘above, it wHl be seen that these conditions are not 
met inasmuch as the ndes given are all non- recursive. ^ It is 
an empirically -based judgment of Ours that alternative structural, 
descriptions occurring, under these conditions are potentially more 
interesting for purposes of grammatical research than those 
covered by the, formulation- in the preceding paragraph. There is 
a formal correlative to this judgment which is illustrated by the 
contrast between situations 3. and 4. above. Note that, in both 
casesi the sample string for which alternative descriptions are 
provided by the rules is the same, namely _A + X^+ . .+.A,-.etc. 
This string type may be described as a repetition of the recursive 
label with a fixed substring interposed between the repetitions. 

Now, in 3. , the fixed interposed substring (i.e. , JC Y) is 
reducible by itself to a single structure (i. e. , . - This same 

substring in 4. , however, is not reducible to a single stinicture 
and consequently its relationship with the recursive label will be 
more complicated and, quite possibly, more interesting. 

Practical considerations have forced the adoption of some 
means to keep these trivially different alternative structures 
out of the parsing lists. On the one hand, their estplosive 
combinatorial growth as a function of string length is disastrous 
for both available mexxiory space and program running time; on 
the other, they result in the production of reams of uninformative 
printout and would constitute a serious obstacle to grammatical 
research. 

1 t 

Several solutions to this kind of problem a,fe available. It 
could be solved by eliminating recursive labels from the grammar 
altogether, or by eliminating all labels which are both right and 
left recursive from the gramxnar. We regard these solutions as 

In the present grammar, there are no labels which are both 
left and right recursive solely by virtue of cycles of non-recursive . 
rules. 




both practically and theoretically undesirable, because the former 
solution, by eliminating recursiveneiss alfogejtKerv/ lea?ves a grammar 
of finite generative capacity,. y;hile the latter >vould require the 
ad^tibii to the grammar of a great many ndes to maintain its weak 
generative: capacity;' Another kind of- solution would involve the 
imposition bf sozne order on the applicabilnty of the iroles in the 
grammar; we have avoided this in order to maintain the 
independence of grammar rules from one another as an aid to 
experimentation. FinaHy, because of the ‘anusually stringent 
constraints on.memory utilization, all solutions requiring the 
maintenance of supplementary lists during parsing operations 
were ruled out. 

Instead, a toggle - switched optionail {subroutine was added 
to the basic algorithm which recognizes tiuvial ambiguities in 
the making and discards them. This optional node suppression 
subroutine fits into the flowchart on page 11 between boxes (26) 
and (27) . It ascertains, first of all, whe1:her the grammar rule 
to be applied is of the recursive type (i. e , , in the terminology 
of the flowchart, whether^ or ^is identical with C) ; if it is 
jn^, processing continues (to box (27) ) as usual. If the rule is 
I ecursive, then a check of aill parsing list entries of length PL 
is made to determine whether any>one of these has the same 
entries in the INW and GL columns as would the potential new 
entry. If this condition is met, the potertial new entry is 
discarded, and an exit, is made to box (29) ; if this condition is 
not met, processing continues as usual. This subroutine may 
be represented thus:^ 



It may be of interest to note that this technique achieves 
precisely the same economies in parsing list reduction as does ■ 
that proposed by JanexRobinson of the R/LND Corpo??ation (in 
Endocentric Constructions and The Cocke Parsing Logic, 

|be Inter^tipn^^^ Cpnferenc>3 on Lin^istics, New York, 
May 1965)-,. but wimbut .ihe bvert markdbg and arrangement of 
recursive rules. 






2. 1. r. 3. 2. Vector-Symbol Phrase Markers , A second, 
and more interesting, modification of the basic algorithm was 
adopted to deal with a problem which is particularly severe when 
dealing with highly inflected languages such 0.s Russian. The 
problem is essentially that a context-free phrase structure grammar 
for suoh a lang will contain many subsets of rtiles, dealing 
\l^it]i^a^ and gpverj^ont situations, /^hich clo^ 

resombio the member, an gender presented 

in classic For ihstihcej tp rlihjk up an adjective 

^it!fc:ai ipU;^ in; a sentence from such language, ah 
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imposing array of rules, having the following general form,, are 
required: 



.+ N ■ . • 

masc.« , sg. , nom.. masc. , sg. , nom. 



^ + N 

masc. , sg. , gen. xna.sc. , sg. , gen 



masc. , sg. ,nom. 
NP 

• masc. , sg. , gen. 



masc. , pic , nom. ^ ^masc. , pi. , nom. 



NP 



masc. ,pl, ,nom. 



0 



A 



fern. , sgi , nom. . ■*■ ^fem. , sg. , 



nom. 



NP 



fern. , 3g. ,nom. 



0 

0 

0 

0 

0 




and so on. 

There is something inelegant about such collections of rules 
and, more importantly from a practical point of view, they consume 
a tremendous amount of storage space. Moreover,, these sets 
of rules constitute an impediment to the kind of research in which 
we have been engaged — elementary modifications to the grammar 
would often involve the rewriting of entire sets of such rules. 

The solution fi n ally adopted was suggested by the label + subscript 
notation employed in the illustration above; it consists of splitting 
each component of a grammar rule into two parts— a fixed part, 
denoting a major grammatical category, and a variable part, 
denoting sublca.sses within the major category. 











2. le 1.3o2. 1 Nuxnerical Format . Some preliminary 
e^lanation is necessary before the details of this vector >'symbol 
(VS) arrangement can be presented. It has been mentioned that 
the word length of the BR-130 computer is 15 bits, and that the 
basic algorithm was initially programmed in FORTRAN, using 
integer arithmetic. There was therefore a.vailable a range of 
non-negative numbers from zero through 16,383) ^ for the 

representation of grammatical labels. In deciding how to 
implement a more compact arrangement of grammar rules of the 
kind mentioned, the question of how to accomplish the task within 
these numerical constraints arose. For a variety of reasons, 
we finally decided to employ the units and tens positions of the 
munerical grammatical labels for the e3q>ression of "paradigmatic 
variations", and to employ the three most significant digits as 
fixed major category tags. To permit full utilization of the 
variable digits, it was, of course, necessary then to restrict 
ourselves to the range 00000 through 16299 for the expression of 
grammatical labels. The fixed major category tags would then 
vary from 000 through 162, while the variable or "suffix" tags 
could range from 00 through 99. 

Finally, in order to achieve maximum flexibility in the 
overall arrangement, the two variable or "suffix" digits were 
effectively split off from one another, so that they might be 
manipulated independently. This gave the result that each 
component of a grammar rule consisted of three segments: 

(1) a major grammatical category tag, represented by a non- 
negative integer from 000 through 162, (2) a suffix variable, 

ranging from 0 through 9, (3) a second suffix variable, ranging 
from 0 through 9. The grammatical label 12374 could thus be 



Unless explicit mention is made to the contrary, all 
numbers in the remainder of this report will be in decimal notation. 
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thought of as consisting of the segments The convenience 

of such a convention is that it permits one of the suffix variables, 
when attached (for example) to the major tag for Noun , to represent 
gender and number, while the other suHix tag might represent 
case. 



2.1. 1.3. 2. 2 A Formalism Governing the Application of 
Grammatical Variables . In order to take advantage of this schema 
to render more compact the list of grammar rules, a formalism 
was required which would permit the replacement of a paradigmatic 
subset of rules by a single "cover” rule. To illustrate the full 
set of conventions finally adopted, the following notation will be 
used: 



A. 



Grammar rules will be represented as 



AV 










$ 



where A, B, and C represent major category 
tags, ..and the multiply-subscripted Vs represent 
the suffix variables appended to them. 



B. To repz*esent grammatical labels of neighboring 
nodes (candidates for rewriting) , the same 
arrangement wHl be used, except that .*'B", and 
."C" replace their underlined counterparts, 
and the symbol "T", with subscripts as above, 
will be used instead of "V"j thus, 

AT^^ T^ will represent a pair ^ 

of adjacent nodes for which an appropriate 
grammar rule is being sought; GT T 

will represent the grammatical label derived 
from the application of a rule to the foregoing 
pair. 
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Ichft sj^febl, “x” ranges qy^r, . a^b . j; Vi” «tnd ”j*‘ 

. range bve# 1» 2 . TOe formalism bas' two parts: 
the first governs the testing ef the applicability 
, of a given rule tot a given eahdi<^te-pair: 



I. 


If A and B 


= ^ the ride applies. 




except that. 


- 


l.a.. 


0< V < 9 

X. 

I 


requires that T = V 


i;b. 


V ..= v, =0 
a. b. 


requires that 



The second part of the formalism specifies the composition 
of the grammatical label derived from the application of a ride 
under the above conditions: 



n.a. c 




n.b.i. V :^o 


requires that T = V 


I 


Ci c. 


n.b.2a. V =V =0 

c. X. 

1 X 


requires that T - T j 

Ci %i 


II.b.2b. V =V =0 

c. X. 

1 J 


requires that T = T 

C • IKb « 

1 J 


2. 1. 1. 3. 2. 3 E^mlanation of the Formalism. The net 



result of these conventions can be roughly summarized in plain 
English as follows: 



1) 



When any digit other than ”0*' or "9” appears in a 
suffix-variable position of a grammar rule component, 
on the left of the rewrite sign, a match is required 
between that digit and the corresponding digit of the 
appropriate candidate node label. 
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;r;T 



Q 



0 



n 

tj 



0 



0 




Wbeii tlie di^t ”9" is used in a siiffix-VaHable' position 
of a graL^miar rule udmponent, bn tfie left of the 
rewrite sigh; then nO matching at all is required in 
that position. 



S) When the di^it *‘0” is used in corresponding suffix 

variable locations in BOTH grammar rule components 
oh the left of the rewrite sign, then the corresponding 
two digits in the candidate nodes must be identical: 
otherwise the effect of a variable suffix *‘0*’ is the 
same as that of a ”9". 



4) When the digit "0" is used as a suffix variable in the 
output component of a grammar rule'(i.e. , on the 
right of the rew.rite sigh) , then it indicates that suffix 
information is to be carried upward from either or 
both of the candidate nodes to the new parent node; 
the location(s) of one or more zeros, in suHix positions 
to the left of the rewrite sign, ihdicate where the 
information carried upward is to be taken from. The 
existence of ”0** in a suffix position of the* output 
component of a rule always reqiiires that there be at 
least one "0" in a suffix position on the left of the 
rewrite sign of the same rule. 

2. 1. 1. 3. 2. 4 Examples . To illustrate the usefulness and 
clarify the mechanics of these conventions, let us consider two 
hypothetical cases. 

Suppose we need to write a rule covering the government of 
a noun (OOl-) by a preposition (025-) . It will be necessary to 
determine that the noun is of the proper case. This could be 
accomplished by writing a series of rules, one for each case, 
gende^Y and ni^ber situation; — this was ^ in fact, required before 
the algprithm w;as; modi^^^ if w,e aUbw the first suffix 

vaiiiable with nbuhs';|^^^^^ gender and number/ and the 






second. ca^e^ the second suffix variable 

for prepositipns to i*epresen:t case also, then.^a single rule: 
02590 + Op jQQ ~ suffices ipr aU instances. Assuming 

that a- neighboring pair of nodes with labels 02563 and 00173 are 
encOjimtered, we can test the applicability of the above rule and, 
if applicable, determine the composition of the new parent node 
by referring to the formalism given on page 19. To make this 
easier, the components of the rule and of the node labels can be 
segmented and juxtaposed: 

GRAMMAR RUiE: 025-9-0 

NODE LABELS: 025-6-3 



001 - 0-0 



001-7-3 



037-0-0 



037-7-3 



Referring to rule I. of the formalism, we test whether the 
major grammatical tags in the rule are identical with those of 
the node labels; since they are, we go on to rule I. a. , which is 
seen to be inapplicable.. Rule 1. b. applies for i = 2, requiring 
that the second suffix variables of the node labels be identical; 
this is seen to be the case (both are ”3") . This last test amounts 
to a test for ’’agreement” between the case specification of the 
preposition and that of the noun; at this point, since no more 
test conditions are listed, the rtile is found to be applicable. 

To determine the composition of the resultant parent node, 
we refer first to rule II. a. of the formalism; this enables us to 
obtain 037- ,as the xnajor grammatical tag of the parent nod*^. 
Rule II. bi 1. is seen to be inapplicable. Rule ll.b. 2a. applies 

'i 

for X = b and i = 1 , enabling us to fill out the resultant label to 
0377!: t Rl^e II. b. -2a. also, applies for i = 2 and x =(a or b), 
enabling us to .deterrhlne the final digit of the resultant label, 
giving 03773; as the fii^l result. Rule II.b.2b. is not consid.ted, 
since the task is already complete. 



*Itxhity be of ihterest't^^ to note that it has hot 

been fpuhdlttechsiaT^ in the C-F 

thal* Wou^d :^equire the 
'thisL^uie has 
for reasons pf 

i^ihgv S35^mcJt^^h^ inadvisable to 



-V- /i:.;-'.' 



' 







It ^in^be'n'oted-^ were the noun in the ^bove example 
' not ofthe if it carried the label 00176) , then 

the ^e could hot haVe been applied because of the condition 
expressed by rule I. b. of the formalism. Observe also that the 
-first suffik Variable in the preposition*s label (i, e. , ”6“) is 
ignored altogether because of the "9” in the corresponding position 
in the gr amm ar rule. Similarly, the gender-number suffix 
variable in the noun’s label (i. e. , ”7**) is not considered in 
testing the applicability Of the rule — ^the reason for this is given 
in 3) of the notes on pageZO. 

Let us consider one more illustrative example of the 
mechanics of the modified algorithm. Suppose we have a 
grammar rule: 01205 + 11604 06313, and a neighboring 

pair of nodes in the developing tree structure with labels: 

01265, 11664. Riile I. of the formalism is satisfied. Rule I. a. 
holds for x=a, i=2 and also for x=b, i=2; in both cases, it is 
satisfied. Rule I. b. holds for i=l, and it, too, is satisfied. The 
rule is therefore said to be applicable to the candidate -pair. 

Rtde II. a. tells us that the resultant parent node label will begin 
063-. Rule II. b. 1 applies for i=l and adso for i=2, yielding a 
final resulting node label 06313. In this instance, "agreement** 
was required between the candidate node labels in the first 

suffix position only; no suffix information is carried upward to 
the parent node label. 

2. 1.1.3. 2. 5 Remarks . As complei^ as the formalism, 
the informal e3q>lanation, and the examples may make this 
modification to the algorithm appear, in practice it becomes 
simple in the extreme; and its adoption has resulted in a 
grammar that is both greatly reduced in size and considerably 
more elegant than it could otherwise be. A beneficial side effect 
of this technique has been to greatly enhance the mnemonic value 
of individual grammar rules, which eases the interpretation 






This is due to 

the VS of tfie gje^raitical labels and 



‘ the faMUiSt ^fea^rs^iffix a 

* ’ ' . • ' 'T 

s^ilaiilj^^hat ^as been heavily e^lpited in the agsignment of 
gramj^t|c^la ; 

» In Mlating Ihis. mod^^ the algorithm to the flowchart 

the subroutines icorresponding to. rule I of the 
f orzhalism sho^d be ihc^ded in box (26 ) , while those corresponding 
to rule II belong to box (28) . 

One rather diiferent approach to dealing with the problem of 
inefficiency inherent in context'^free phrase -structure graxximar 
rules is known to the authors. Briefly, it involves the use of 
grammatical variables whose status is tested against stored tables 
of conditions which are chained together in a highly flexible manner. 
We have not adopted this interesting technique, because its imple- 
mentetion would tax the menaory capacity of the hardware system 
at our disposal, and also because it seemed preferable for our 
purposes to 63q>ress the entire significance of each grammar rule 
in a single formula — ^this enhances the mnemonic value of the rules 
and imposes a somewhat tighter structure on the grammar itself. 

2. 1. 1. 4 Input/ Output Conventions 



2. 1.1.4. 1 Grammar Rules, Input and Output . The rules of 
the grammar are punched, one rule per card, on standard 80-column 
data cards. Each component of a rule is punched as a 5-digit decimal 
number, with preceding zeros where required. The resulting 15 
digits occupy columns 7 through 21 of the card, the remainder of 
the card being irrelevant (i, e. , it is ignored by the input subprogram) • 
The grammar rule 03165 + 12102.— ►00321 would be punched 
013651210200321, beginning in column 7 of the grammar rule card. 



' This is due to Martin Kay of- the RAND Corp* Economical 
sch^me^ have been developed, alsqat RANP, foT categprial and 
dependency grarbmars, Mt these deaf with a rather distihct i^et of 
problems and ai^e hot :?;eie discussion. The system, 

prpppsed by G. H. jll’arman (ih^ Gtajanm^ijs Withput 

Trahsibrnaafibnai. 'A Defense of Phrase St^cture*’, 

Dangua|e ^ 39, i|. ie yery similar , to a VSPG as described 

. herp' jeycept that ' (ijjl- |t and 
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Tiie cards cohtaiiiing the gramniar risles are assembled 
|hahtlail^ intb:'a^4eck*^hich iis in numerical' ascbhdinjg order. 

In the orderingi btdy-the major grammatical category tags ef 
the first two comj^onents are considered; these are treated as 
though they formed a single continous number from the left. 

For purposes of ordering the rule deck, therefore, the card 
containing the sample rule above would have a value of 013121. 
Consequently, it would appear nearer the top of the deck than a 
card bearing the rule 02513 + 16299 —► 16100 (whose “ordering 
value” is 025162). 

A special card is placed on the bottom of the deck as a 
signal to the input subroutine that it has finished its task. When 
this card is read, input operations terminate and a special 
subroutine is called which performs two functions; (1) it ascertains 
that there are no errors in ordering of the rules, and (2) it marks 
the boundaries of “families” of grammar rules by converting the 
leftmost component of the last- received rule of each family from 
a positive to the corresponding negative integer (see examples 
in Appendix D. 

It is well known that, for lists longer than a fairly small 
threshold size, search operations can be most efficiently 
accomplished by the binary search technique. This technique 
requires that the list to be searched be numerically ordered— 
it is for this reason that the grammar rule deck is ordered in 
the fashion described above. (Box (25) of the flow diagram on 
page 10 indicates the place of the binary grammar search in the 
parsing algorithm. ) Because of the method of ordering the 
grammar rulesj, it is possible that several rules will have the 
same “ordering value"; such a group of rules constitutes a 
"family” of rules (within which the ordering is entirely arbitrary) . 

It v/ili be noted that a group of rul^s constituting a family share 
the property that in a given instance, any one of them satisfies 
rtde li Of the formalism (page. 20) ^ all of them will satisfy it. 




Fbr this reason*. rUiee are:C?he.ckf5d for ajpplicability in family 
groups ^ee_ bpx .(26:) of parsing flow diagram* page It). The 
purpose of -the ibinary search* l^hen* is to locate appropriate 
families of ; 

Since the ordering of rules within a family is arbitrary, a . 
simple procedure is required which will guarantee that none .of 
its members is overlooked in testing. In other words, it is 
necessary to.be able to identify the bcundaries of family groups 
in the rule list. By marking these boundaries in the manner 
described above, this is accomplished with an irreducibly minimal 
expenditure of running time. 

The grammar rule list is output via the line printer in the 
form shown in Appendix D. 

2. 1. 1.4. 2 Sentence Input and Output . Sentences to be parsed 
are keypunched, one "word” per card, on standard 80> column data 
cards. Columns 1 through 24 of the card contain an alphanumeric 
transliteration of the Russian "word" (see Progress Report No. 6, 
page A- 14) . Columns 25 and 26 contain a two-digit decimal number, 
with preceding zero where necessary, specifying the number of 
distinct grammatical labels* assigned the input item; this number 
may range from 01 through 20 (it corresponds to the subscript h 
in box (3) of the input flow diagram) . Column 27 is left blank. 

The remaining columns, froin 28 through 72, in fields of 5 columns, 
contain up tp nine distinct grammatical labels. If the "word" has 
been assigned more than 9 such labels, the remainder are punched 
on a second cardjft in fields of 5 columns, from column 7 through 
column 61, A typical input "word" card appears as follows: 










FORMAT? 



CONTENTS: 

COLUMN: 



transliteration 



PARTIi: 



1 



number 

of 1st 2nd 
labels! label label 

/ / 



I 



5th 

label 

I 



0^ Soi^Soi^ooi^^oi^ 

H f I 1 If 

25 215 33 38 43 48 52 



An input ’'word” need not necesisarily correspond, it must 
be pointed out, to an ordinary Russian text word, as in the above 
example. We have defined the notion "input word" to include 
anything which (1) can be determined by purely mechanical means 
from' the Russian text, and (2) is most conveniently handled in 
the grammar as a separata entit^r. This definition has, in the 
course of our research, included such things as ordinary text 
words, punctuation marks (comma, jiieriod, dash) , initials and names 
("USA", "N", "S", Khmshchev, etc.), and artificial elements 
("sentence-begin", "comma -follower", etc.) such as are discussed 
in section 2. 1. 2 of this report. 

The individual input cards for a sentence are assembled into 
a sentence input deck in natural text order. A special "end-of -input" 
card is added to this deck to trigger tlie transfer from the input 
subroutine to the parsing algorithm pioper (see box (4) of the 



input flow diagram) . As each input card is read into the computer, 
grammatical labels it contains are arranged (as shown in 
diagram 2. 1. 1. 2. 2) in the initial parsing array. At the same 
time, the input items are numbered and priiited via the line printer 
in the iovixi shown in Appendix The alphanumeric transliterations 



are also storeci where they can be later recalled for use in printing 
out the results of the parsing operations. Needless to say, these 

play ao part whatsoever in the parsing procedures*— 
y. data upon which a , dictionary- search subprogram 

enyironmeni:,, and they serve to render 
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4*'3 Output of Parsing Results . Several kiads of 
displays of parsing results .can selected by means of toggie- 
switcb:^ SeU^ One of those Is display of the 

initial parsmg array as: prepared by the input subprogram 
(dmgratn 'Z:, lil.lE. 2) Appendix B, pages B--2 and B-5, shows , . 
a line printer display of the final parsing array after the parsing 
algorithm has terminated; It is from this final parsing ar.ray« 
along with the stored alphanumeric transcriptions of the input 
text, that the remaining displays are derived. 

Appendix B shows a sample of a line printer display which 
depicts a tree structure derived from the input. For each non- 
terminal node in the tree^ the parsing array row number and the 
grammatical label are printed at the left; at the right, the row 
numbers, labels, and corresponding substrings of the daughter 
nodes :-are given. To illustrate, the following tree, labeled "A”, 
would.be represented as in diagram "B" below: 
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It is also possible to obtain a line printer display of well- 

derived from the input. Normally, the subroutine 
winch prepares thi out. in succession all well-formed 

i, beginning with two- 
entires in the parsing array. By 

-u« ^ ^ ' - i ^ ^ ' * 

appr opf late" typewriter -input comi^^ however, selective displays 

cap geu^rete baBis of phrase length or phrase label or txsth. 
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2.1.2 The Vector-Symbol Phrase Grammar - 



The grairm'aatical labels are given in Appendix A. They 
represent both pie basic grammar codes (the results of the 
pseudo -dictionary lookup) , as well as the higher-level labels 
assigned as the result of applying the rules to the basic grammar 
codes (e. g. , participial phrase, sentence, etc. ) . The same 
labels in many cases may refer either to basic grammar codes 
or to higher level constructs (e. g. , a noun by itself may constitute 
a noun phrase) . 

The variables, A and B, as shown on the first page of 
Appendix A, are used to indicate number/gender/person or case. 
The values for each of the digits which may be used in the 
variable positions are given (recalling that 9 andO are special 
symbols used in the rules, but never assigned as grammatical 
labels) . The subscripts "a" and "g“ indicate whether the 
variable illustrates a government or agreement relationship. 

The method of dealing with so-called "homographs" within 
this system cannot be stressed enough. Homographs are marked 
by more than one grammar code, the maximum number of codes 
being twenty. From the codes provided, it can be seen that some 
morphological types which very regularly take on a number of 
different syntactic functions are no less homographs for the 
purposes of this grammar than are many ''accidental" homographs. 
Thus, for example, "cejl " (third singular past of "sit" and 
genitive plural of "village") is not more a homograph (indeed 
less so) than "HOBOil " (feminine genitive, dative, instrumental 
and prepositional of "new") . These alternatives can either be 
incorporated into viable structures accounting for the entire 
sentence or they cannot; if more than one of the grammar codes 
can be incorporated (or the same code incorporated in more than 
one significantly different way) , the sentence is grammatically 
ambiguous; if only one of the grammar codes of a given word 
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may be employed in a construct leading to a sentence derivation, 
its homography may be said to have been resolved. Homographs 
are thus not to be looked upon as “special cases", but simply 
as the normal material upon which the parsing algorithm operates. 

Of course, it is always possible to reduce the number of 
alternate grammar codes at the cost of increasing the number of 
rules. Thus, for example, each unique combination of grammar 
codes assignable to some class of Russian words (including classes 
of only one member) could be given a unique grammar code. 

This would, however, greatly reduce the generality of the grammar 
and make grammar writing a prodigious task. No attempt has 
therefore been made to “telescope" the entries having multiple 
granmiar codes except in a few cases. 

The grammar codes given herein are not complete nor 
even necessarily definitive insofar as they go. They do not, 
for example, account for some cases of dual government (e.g. , 
a short form passive participle taking two instrumental objects) 
nor are nouns taking nominal complements other than genitive 
considered. The grammar codes thus far created are adequate 
for the coding of the vast majority of the Russian sentences we 
encounter. One of the major advantages of such a system is the 
ability to create new grammar codes pro re nata and immediately 
incorporate them in rules in the grammar. 

The Rules 

Appendix D gives the 318 rules currently in use. They are, 
generally, quite readily interpretable though some patience may 
be required. 

A few matters, however, deserve further attention. In 
general, a sentence is recognized as: sentence begin symbol, 
legal sentence tree, sentence terminating symbol. Of course, 
since the algorithm is binary, the recognition of these components 
must be done in two steps. The reason for incorporating a symbol 
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to indicate the beginning of a sentence is that some Russian 
structures occur only at the beginning of sentences or after commas 
> gerund clauses) • If the beginning of the sentence were not 
indicated, the rules woidd have to be formulated in such a way as 
to accept such structures whether or not they were preceded by 
a symbol indicating their left boundary (since they might occur 
sentence initial) which could lead to imdesirable results* 

In connection with this, another problem arises — that of the 
dual functioning of Russian punctuation. There are many Russian 
constructions which are obligatorily marked as to their beginning 
and ending (e,g. , gerund and relative clauses). Unfortunately, 
one mark of punctuation may serve to both end one construction 
and initiate another or to end two such constructions simultaneously. 
Consider, for escample, a relative clause occurring at the end of 
a. sentence and obligatorily surrounded by punctuation indicating 
its beginning and end. In this case, however, the period that 
marks the end of the relative clause also marks the end of the 
sentence. 

Several alternatives come to mind for the solution of such 
difficulties; three will be discussed below. 

£'irst, the rules may be so formulated as to accept such 
structures whether they are preceded and/or followed by overt 
boundary markers or not. Thus, boundary markers serving to 
delimit two or more structures simultaneously could unambiguously 
be assigned to one or the other in the course of a sentence 
derivation (caution must, of course, be exercised in framing 
the rules to insure that multiple syntactic interpretations hinging 
solely on which structures the multi-functioning boundary markers 
are assigned to, are not derived) . 

An immediate objection to such expedient is that there is 
no way of knowing in advance whether, under certain specified 
circumstances, ignoring a boundary marker (which is the effect 






of framing the rules in this manner) will result in multiple 
analyses, some of which could have been avoided by taking the 
boundary indicator in question into consideration. In an early 
version of the grammar, this,, in fact, occurred when an incorrect 
derivation was assigned to one of the test sentences as a result 
of ignoring a comma ending a relative clause. 

In general, it is undesirable for a parsing grammar to rely 
on the non°occurrence of certain (presumably) ilI>formed strings. 
Aside from weakening the grammar’s ability to recognize ill-formed 
input, presumptions about ill -forme dness are often ill- conceived, 
since rarely are all the ramifications of such a decision apparent 
from the cutset. 

Alternative Z requires the inputting of a "dummy" word 
following each potential boundary marker. This dummy element 
receives the same grammaticetl classification as the potential 
marker (and thus, of course, is a potential boundary marker 
itself) . Rules are included for combining dummy elements and 
boundary markers to produce a single boundary indicator to handle 
those cases in which such elements (e. g. , comma, period) do not 
serve multiple functions. 

This option was, in fact, exercised for a while with generally 
gratifying results. It is easy enough to see, however, that when 
the number of structures obligatorily initiated or concluded by a 
single mark of punctuation reaches three, the technique is 
inadequate and would have to be augmented by the method to be 
described subsequently (i.e. , that of "carrying up" grammatical 
information to higher level structures) . While few sentences having 
punctuation serving the functions described have been encountered, 
they must nevertheless be taken into account. The technique was 
therefore abandoned for the one described below. 

The third alternative is to formulate the rules in such a way 
that information about the termination of lower level structures 
is carried upward to be used in determining whether the same 
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boundary ntmrkeir may also indicate the termination of a higher 
level structure# 

Consider the parsing for the sentence illustrated on pages 
C-22 and C-23 in Appendix C* The sentences ends with a participial 
phrase which requires a mark of punctuation to terminate it. In 
this case, the period terminates both the participial, phrase and the 
sentence. As can be seen from the tree structure, the sentence 
end information was carried up to the noun phrase object, the 
verb phrase, and ultimately the sentence. This solution, though 
altogether adequate and now functioning reasonably reliably, is 
not so elegant as one might wish, since it requires the creation 
of nodes for both sentence terminating and non- sentence terminating 
noiin phrases, verb phrases, etc. The problem can be solved in 
a general and somewhat simpler way by increasing the number of 
variables or by allowing context sensitive rules. In is anticipated 
that in some of our future eiqperiments either or both of these 
possibilities will be explored. 




3. CONCLUSIONS 



Our primary concern has been the creation of a formalized 
sentence analysis system capable of describing Russian sentences 
adequately for machine translation, information retrieval, and 
related data processing operations. Our initial experiments were 
based upon a binary, context-free, phrase -structure grammar, 
c hosen because of the simplicity of the companion algorithm and 
of the form of the grammar itself. We did not, of course, believe 
tha,t such a system would be adequate to our main task; rather, 
we felt it would be useful to explore its weaknesses in some detail 
as a preliminary to the design of a more suitable system. These 
weaknesses are much discussed in the literature from a general 
linguistic point of view. Our exploratory study was undertaken 
to develop empirical results that would help to sharpen the 
relevance of such general discussions for our own rather limited 
<^ta processing applications. 

The two most serious deficiencies we encountered were, 
predictably, the multiplicity of grammar rules required to account 
for agreement and government situations, and the proliferation of 
trivicdly different structural descriptions assigned to even simple 
input strings. Both of these problems might be dealt with up to 
a point — ^at least for our purposes — ^by allowing the grammar to grow 
to the very limits of practicality. But a solution of this kind is 
not only uninteresting, it is quite hopeless as well, in view of 
the burden it places on the research grammarian and the computer’s 
internal storage capacity. 

To overcome these difficulties, the vector-symbol phrase 
grammar, with node suppression, was designed and incorporated 
in the parsing system. The "abbreviative" conventions for 
collapsing large sets of PS rules into small sets of VS rules have 
made it possible to undertake seriously the writing of a phrase 



gifcuxiizi^r for 'written Huseisin suitable for our dntn. processing 
applications. In Qur opinion, the practical potential of this type 
of grammar— whose appeal lies in its basic simplicity and in the 
straightforwardness of its implementation*— ^las not been sufficiently 
ei^lored or ei^loited. The rapidity with which it has been possible 
to develop the grammar described in this report is felt to be one 
justification of this position. The grammar is far from “complete”, 
except as measured against most other operating formal grammars 
. ■ Russian text! it does,, however, deal in a general fashion with 

a broad range of Russian sentence types, and no major intrinsic 
impediments to its continued evolution have been encountered. 

Appendix B gives computer output of parsings for. two sample 
sentences* The Bnglish hsis been typed on the examples to assist 
those unfamiliar with Russian in following the parsing. The 
parsing list is provided as well as the equivalent of a tree diagram 
for each sentence. 

Appendix C gives sample sentences and their parsings, 
illustrating a variety of Russian sentence types. For convenience, 
the trees have been redrawn from the computer output. Interpre- 
tations for the labels on the nodes may be found in Appendix A. 
Interested readers can refer to the rules in Appendix D to see 
the way in which they were applied to the sample sentences. 
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GBAMMA.T1CAL LABELS 



1 



Sigj^icance. of Varsablos 

' Nttiaber/geader or person 

I, MascuHne 
1 2; Feadiatne 

3. Neuter 

4. PlurM 

5. First person smgtdar 

6. Second person singular 

7. First person plur^ 

8. Second person plural 

Be Case _ 

1. Nominative 
2e Genitive 
3o Dative 
4o Accusative 

5, Instriiinental 

6, Prepositional 

7 , Accusative animate 




Subscripts 

a* Agreement 
g. Government 



?/e are indebted to Warren Joseph Plath*s Ife-rvard doctoral thesis 
for several. of, our syntactic classes, (l^athematic Linguistics and 
A^tohiatxc Trahsiatipn, Report No. NSi^-iZ to the National Science 
Foundation^ Anthony G. Oettinger, Principal Investigator, Cambridge, 
Massachusetts, June, 1963.) 
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<1^REPI€ATIVES 



f m 



I 









P 



II ^ 





002A B 
a.g 


2) 


002A 8 
a 


3) 


012A 8 
a 


4) 


022A^B^ 


a g 


5) 


032A B„ 


a- -g 




042A B„ 
a g 


7) 


■052A^Bg 


s) 


062A^Bg 


n 


^ Q72A^ E* 
a g 


10) 


082A 

a g 


U) 


092A^B^ 
a g 


12) 


102A B„ 
a g 


13) 


112A B_ 
a g 


14) 


122A B„ 
a g 


15) 


132A^B. 
^ g 


16) 


142A B 
a g 


17) 


152A B 
a g 



Finite, transitive 9 personal 

Verb phrase with object 

Finite*: intransitive* personal 

Impersonal 

Short form adjective 

Future form ol BYT* (to be) 

Past form of BYT* (to be) 

Short form comparative adjective 

Short form participle 

Short form, infinitive government 

Verb phrase with a period 

Fast participle with nominal clause subject 

Impersonal verb with infinitive subject 

Personal verb with nominal clause direct object 

Personal verb with infinitive direct object 

Personal verb with dative object and object 
in another case (as specified by the variable) 

Model auxiliary 



NOMINALS 



Values for variables 

4th octal digit Number/ gender or person 
5th octal digit Case 



1) OOIA B 

' a a 



2 ) 

3) 

4) 

5) 

6 ) 
7) 



021A B 
a a 

031 A B 
a a 



051 A B 
a. a 

06 1 A B 
a a 



OO^SB, 



00676 



Noun taking adjectives and a genitive 
nominal complement 

Surname 

Title (e.g., GENERAL) 

Coordinated noun phrase 
”SMTO” (as nominal) 

Coordinated conjunction plus nominal 
Name of month in genitive singular 



A-3 



I 



■erIc 













UninodxfiaMe noznih^ {^•S^ $ pronouns) 
Noun phrase with a period 



INFINITiVES 



. 1) 0048B 

. g 

2) 00488 

3) 004B„B„ 

g g 

GERUND 



Infinitive taldng one object 

Infinitive phrase with object(s) (or intransitive) 
Infinitive taking two objects 



T) OTIA^B^ 
8) 091A^B^ 



1) 00588 

2) 0058B 

. • ' g 

3) 004B B 

g g 

4) 00578 



Gerund phrase with object(s) 
Gerund taking one object 
Gerund taking two objects 
Gerund phrase plus comma 



intrans 



ADVERBS AND SPECIAL WORDS, PHRASES 



1) 007U 

2) 00712 

3) 0063A B 

a a 

4) 00777 

5) 01711 

6) 01811 

7) 00733. 



Adverb of manner 
Adverb of time 
Relative pronoun 
Subjunctive particle '‘BY“ 

Single capitalized Cyrillic character 
Capitalized characte;r plus period 
Negative "NET” 



r , ’ 



:■ 






A/^4 






PREPOSITIONAL PHRASES 



II 



D 



i) 

4c) 

5) 



3) 

4) 

5) 

6 ) 



0088B 



■0188B. 



g 



0188.8 

0288B 



0788B 



1) * 063A B 
' a a 



2) 064A 8 

' a 



074A 8 
a 

065A 8 
a 

063A 8 
a 

084A 8 
a 



Preposition. 

Prepositional phrase 
+ prepositional phrase 
Prepositional phrase + period 
Prepositional phrase + dash 



R.ELATIVE CLAUSES 
Relative pronoun 



Relative pronoun and verb phrase 
(relative clauses) 



Relative clause plus period 
Relative clause plus comma 
Comma plus relative clause plus comma 
Comma plus relative clause plus period 



PUNCTUATION 



1) 


00611 


Comma 


2) 


00621 


Coordinate conjunction 


3) 


00631 


Period 


4) 


00641 


Exclamation point 


5) 


00651 


Sentence begin symbol 


6) 


00661 




7) 


0068B 

a 


Coordinating conjunction (or comma) 
plus noun phrase 


8) 


00753 


”SHTO" (as conjunction) 


9) 


00715 


Dash 


10) 


0077B 

a 


Dash plus noun phrase 


11) 


00716 


Dash plus nominative noun phrase 


12) 


00722 


Left quote 


13) 


00723 


Right quote 


14) 


041 A B 
a a 


Left quote plus nominal 




CLAUSES AND SENTENCES 



I 



o 




1) 


12345 


2) 


9988 


3) 


9987 


4) 


9888 


5) 


9887 


6) 


985B 


7) 


g 

986B 


8) 


g 

00854 


- 9) 


00855 


10) 


00753 



Sentence begin symbol + seiience + period 

Regular sentence (subject Predicate ) + 
period 

Inverted sentence (predicate subject) + 
period 

Regular sentence (see 2) 

Inverted sentence (sec 3) 

Transitive verb (without object) + subject 
Subject + transitive verb (without object) 
"SHTO” clause plus period 
Comma plus "SHTO“ clause plus period 
"SHTO” (introduces nominal clauses) 



ADJECTIVES AND PARTICIPLES 



Values for variables; as for nominals 



1) 003A B 

* a a 

2) 023A B 

a a 

3) 033A B 

4) 043A B^ 

5) 053A^B^ 

a a 

6) 073A B 

a. a 

7) 083A B 

a a 

8) 03 7A B 

9) 038A^B^ 

* 3 , 3 , 

10) 093A B 

a a 

11) 0965B^ 

cL 

12) 00672 

13) 00675 



Adjective 

Participle, genitive government 

Participle, dative government 

Participle, accusative government 

Participle, instrumental government 

Participle, accusative animate 

Participle, intransitive or transitive 
participle plus object 

Unmodifiable adjective (e.g. , NA)IJ) 

Comma + participial phrase 

Comma + participial phrase + period 

Comma + participial phrase + comma 

Numeral requiring genitive plural nominal 

Numeral less than 31 (combines with 
months to form adverbiais of tim^e) 
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■ts 
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r t - . 








-- ~ * ■ -S * ’ 




t "i- * 

i y -J / - ' ' 


1? 


651 


« 




2 ? 


krome 




Besides 


•u 


882 




(Furthc 


3 ? 


TOGO 

3fi2 3732 


7112 


that 


4 ? 


NAS 

7147 7146 




us 


5? 


V02MUWALA 
224 227 




revolted 


6 ? 


osuhestvl=ema= 

5321 




practices 


77 


PRAVITEL*STVOM 

135 




(by) government 


S7 


S>A 

141 142 


143 144 145 146 


USA 


97 


SISTEMA 

121 




System 


10 7 


perexvata 

112 




(of) interception 


117 


I 

621 




and 


CM 


DE)IFR0VANI= 
132 • 141 


144 


decoding 


137 


SEKRETNYX 
342 346 


347 


(of) secret 


147 


SOOBWENIJ 

142 




messages 


157 


SVOIX 

3742 3746 


3747 


(of) its 


167 

177 


S0Q2NIK0V 
142 147 

631 




allies. 



END OF SENTENCE 



PAIRS = 506 RULES TESTED = 443 

LEFT MATCHES = 275 FULL MATCHES = 155 

HIGHEST STRUCTURE = 99. / 1) / 12345 



non-terminal nodes = 66 
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FINAL PARSING LIST 






1. 


1 


651 


1 


0 


2. 


2 


882 


2 


0 


3. 


3 


3712 


3 


0 


4. 


3 . 


3732 


4 


0 


5. 


3 


7112 


5 


0 


6 • 


4 


7147 


6 


0 


7. 


4 


7146 


7 


0 


8. 


5 


224 


8 


0 


9. • 


5 


227 


9 


0 


10. 


6 


5321 


10 


0 


11. 


7 


135 


11 


0 


12. 


8 


141 


12 


0 


13. 


8 


142 


13 


0 


14. 


8 


143 


14 


0 


15. 


6 


144 


15 


0 


le. 


8 


145 


16 


0 


17. 


8 


146 


17 


0 


18. 


9 


121 


18 


0 


19. 


10 


112 


19 


0 


20. 


11 


621 


20 


0 


21. 


12 


132 


21 


0 


22. 


12 


141 


22 


0 


23. 


12 


144 


23 


0 


24. 


13 


342 


24 


0 


25. 


13 


346 


25 


0 


26. 


13 


347 


26 


0 


27. 


14 


142 


27 


0 


28. 


15 


3742 


28 


0 


29. 


15 


3746 


29 


0 


30. 


15 


3747 


30 


0 


31. 


16 


142 


31 


0 


32. 


16 


147 


32 


0 


33. 


17 


631 


33 


0 


34. 


2 


1882 


2 


5 


35. 


4 


228 


■ 6 


9 


36. 


6 


8321 


10 


11 


37. 


7 


135 


11 


13. 


38. 


9 


121 


18 


19 


39. 


11 


682 


20 


21 


40. 


11 


681 


20 


22 


41. 


11 


684 


20 


23 


42. 


13 


142 


24 


27 


43. 


15 


7142 


28 


31 


44. 


15 


7147 


30 


32 


45. 


1 


1888 


1 


54 


46. 


6 


8321 


10 


37 


47. 


10 


5142 


19 


39 


48, 


12 


132 


21 


42 


49. 


12 


141 


22 


42 


5C. 


12 


144 


23 


42 


51. 


14 


142 


27 


43 


52. 


9 


121 


18 


47 


53. 


11 


682 


20 


48 


54. 


11 


681 


20 


49 


55. 


11 


684 


20 


50 
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r:T ;r 1^ 'r .jX cV? . Jl: 



I 




56. 


13 - 


142 


24 


51. 


57. 


9 


5141 


38 


40 


59. 


6 


121 


46 


18 


59. 


5 


9854 


8 


58 


60. 


5 


9857 


9 


58 


61. 


10 


5142 


19 


55 


62. 


12 


132 


21 


56 


63. 


12 


141 


22 


56 


64. 


12 


144 


23 


56 


65. 


6 


121 


46 


38 


66. 


5 


9854 


8 


65 


67. 


- 5 


9857 


9 


65 


68. 


9 


121 


18 


' 61 


69. 


11' 


682 


20 


62 


70, 


11 


681 


20 


63 


71. 


11 


684 


20 


64 


72, 


■ 4 


9858 


35 


58 


73. 


4 


9887 


35 


58 


74. 


9 


5141 


38 


54 


75. 


10 


5142 


19 


69 


76, 


4 


9858 


35 


65 


77. 


4 


9887 


35 


65 


78. 


6 


121 


46 


52 


79. 


6 


5141 


65 


40 


80. 


5 


9854 


8 


78 


81. 


5 


9857 


9 


78 


82. 


9 


121 


18 


75 


83. 


9 


5141 


38 


70 


84. 


4 


9858 


35 


78 


85. 


4 


9887 


35 


78 


86. 


. 6 


121 


46 


68 


87. 


6 


5141 


65 


54 


88. 


5 


9854 


8 


86 


89. 


5 


9357 


9 


86 


90. 


4 


9858 


35 


86 


91. 


4 


9887 


35 


86 


92. 


6 


121 


46 


82 


93. 


6 


5141 


65 


70 


94. 


5 


9854 


8 


92 


95. 


5 


9857 


9 


92 


96. 


4 


9858 


35 


92 


97. 


4 


9887 


35 


92 


98. 


4 


9987 


97 


33 


99. 


1 


12345 


45 


98 



n 



y 



I) 



D 

n 





CONSTRUCT / LABEL COMPONc^TS / LABEL / PHRASE 



. 12345 



45. 1888 



KROME 

TOGO 



Besides 

that 



(Furthermore) 



98. 



9987 



NAS 

VOZMUWALA 

OSUWESTVL^EMAs 

PRAVITEL*STVOM 

S)A 

sistema 

perexvata 

I 

DE)IFROVaNI= 

sekretnyx 
SOOBWENIJ 
' SVOIX 
SOQZNIKOV 



us 

revolted 

practiced 

(by) government 

USA 

System 

(of) interception 
and 

decoding 
(of) secret 
messages 
(of), its 
^lies. 



1888 



651 



34. 1882 KROME 

TOGO 



Besides 

that 



(Furthermore) 



9987 97. 



33, 



9887 NAS 

VOZMUWALA 

osuwestvl=ema= 

PRAVITEL«STVOM 

S)A 

SISTEMA 

PEREXVATA 

I 

DEJIFROVANIs 

SEKRETNYX 

SOOBWENIJ 

SVOIX 

SOQZNIKOV 



631 



us 

revolted 

practiced I 

(by) government 

USA 

System 

(of) interception 
and 

decoding 
(of) secret 

messages I 

(of) its 
allies « 




^-4scssssra'i£ 





5^4 1882 2 

5. 

97. 9887 35. 

92. 



35. 228 



6 . 



9. 



92. 121 



46 . 



82 






46, 8321 





882 

7112 

228 

121 



7147 

227 

8321 

121 



5321 



KROME 



TOGO 



NAb 

VOZMUWALA 



OSUWESTVU=EMAs 

pravitel*stvom 

S)A 

SISTEMA 

persxvata 

T 

DE)IFR0VANI= 

SEKRETNYX 

SOOBWENIJ 

SVOIX 

S002NIKOV 



NAS 



VOZMUWALA 



osuwestvl=ema= 

pravitel«stvom 

S)A 



SISTEMA 

PEREXVATA 

I 

DE)!FROVaNI= 

SEKRETNYX 

SOOBWENIJ 

SVOIX 

SOQZNIKOV 



0SUWESTVL=EMA= 



Besides 




(Furthermora [ 
that ^ 

0 



us 

revolted 


0 


practiced 

Ihy) government 

USA 


ni 

U ' 


System 




(of) interception 
and 


ll: 


decoding 
(of) secret 
messages 
(of) its 
allies 


0 

i 

o' 


us 


Oi 


revolted 


ol 



practiced 
(by) government 



USA 



System ■, 

(of) interception 
and 

decoding 
(of) secret 
messages 
(of) its 
allies 



practiced 





37. 135 



PRAVITEL^STVOM (by) government 
S)A USA 




121 



18. 121 SISTEMA 



B«6 




System 







W 






H, 

rrertr 








. 37 • 135 11 , 

13. 

75. 5142 19. 

69. 

69. 682 20. 

62, 

62, 132 21. 

✓ 

<! 

56. 

56. 142 24. 

51 . 



'5M2- 


^HReXVATA 

I 

de)ifroVani.s 

SEKRETNYX 

SOOBHENU 

SVOIX 

SOQZNIKOV 


(of) interception 
and 

decoding 
(of) secret 
messages 
(of) its 
all..^s 


135 


PRAV I TEt;*STVOM (by) government 


142 


S)A 


USA 


112 


perexvata 


(of) interception 


682 


I 

DE)IFR0VANI= 

SEKRETNYX 

soobmenij 

SVOIX 

SOQZNIKOV 


and 

decoding 
(of) secret 
messages 
(of) its 
allies 


621 


I 


and 


132 


DE)IFROVaNI= 

SEKRETNYX 

SOOBWENIJ 

SVOIX 

SOQZNIKOV 


decoding 
(of) secret 
messages 
(of) its 
allies 


132 


DE)IFR0VANI= 


decoding 


1'42 


SEKRETNYX 

SOOBWENIJ 

SVOIX 

SOQZNIKOV' 


(of) secret 
messages 
(of) its 
allies 


342 


SEKRETNYX 


(of) secret 


142 


SOOBWENIJ 

SVOIX 

SOQZNIKOV 


messages 
(of) its 
allies 









m 







1 4 “ - " 

* i f ' ‘ 


\ * ' 
' - .-■ ' •- '^v • 






'~ ‘ - — *.^K T ' 

'■. :v.^^V^.^§:;V -.W-i ■ 






:: xn^^sages 


' ‘ ■■ ;^ vi:-^?:/_- ;.;..-4;:j r"-:'-/. ‘^' • 


.}-'-r.y.-j.^A^ '■' 


7142 


SVOIX 

" soqzwxKov 


(of) its . 
allies 


43-. 714a 


# 

OO 

CM 


3742 


syoix 


(ofy its 


V 


31. 


142 


SOQZNIKOV 


allies 



END OF STRUCTURE 



m 






0 





0 



0 

0 

0 

0 










• V-' ..631 


- 


it ^ 


SeG0DiM= 


Today 




r 712 


to 


H)A 


our 




3721 




4> 


P AST 1 5 


party. 


5) 


121 

9 


* 


611 




6) 


V6S* 


an 


7} 


3711 3714 7111 7114 

sovetskij 


Soviet 


8) 


311 314 

NAPOD 


people. 


9> 


111 114 

$ 


10) 


611 

PROGR£SIVNA= 


progressive 


11) 


321 

OBWeSTVENNOST* 


society 


12) 


121 124 

MIRA 


(of) world 




112 




13? 


0TME(AQT 


mark 


14) 


244 

50*LETIE 


50th anniversary 


15) 


134 131 

GAZETY 


(of) newspaper 




141 144 122 


- 


16? 


( ( 




17) 


722 

PRAVDA 


“Pravda” — 


18) 


121 
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